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Abstract 

We propose a new algorithm for envelope estimation, along with a new ^/n- 
eonsistent method for eomputing starting values. The new algorithm, whieh does not 
require optimization over a Grassmannian, is shown by simulation to be mueh faster 
and typieally more aeeurate that the best existing algorithm proposed by Cook and 
Zhang (2015e). 
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1. Introduction 


The goal of envelope methods is to increase efficiency in multivariate parameter estimation 

and prediction by exploiting variation in the data that is effectively immaterial to the goals 
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of the analysis. Envelopes achieve efficiency gains by basing estimation on the variation 
that is material to those goals, while simultaneously excluding that which is immaterial. 
It now seems evident that immaterial variation is often present in multivariate analyses 
and that the estimative improvement afforded by envelopes can be quite substantial when 
the immaterial variation is large, sometimes equivalent to taking thousands of additional 
observations. 

Algorithms for envelope estimation require optimization of a non-convex objective 
function over a Grassmannian, which can be quite slow in all but small or modest sized 
problems, possibly taking hours or even days to complete an analysis of a sizable problem. 
Local optima are another complication that may increase the difficulty of the computations 
and the analysis generally. Until recently, envelope methods were available only in Matlab, 
as these computing issues hindered implementation in R. 

In this article we propose new easily computed ^n-consistent starting values and a 
novel non-Grassmann algorithm for optimization of the most common envelope objective 
function. These computing tools are much faster than current algorithms in sizable prob¬ 
lems and can be implemented straightforwardly in R. The new starting values have proven 
quite effective and can be used as fast standalone estimators in exploratory analyses. 

In the remainder of this introduction we review envelopes and describe the computing 
issues in more detail. We let P(.) denote a projection with Q(.) = I — P(.), let be the 
set of all real r x c matrices, and let be the set of all real and symmetric kxk matrices. 
If M G then span(M) C is the subspace spanned by columns of M. vec is the 
vectorization operator that stacks the columns of a matrix. A subspace 7^ C Rp is said to 
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be a reducing subspace of M G if TZ decomposes M as M = + Qt^MQt^. 

If 7?. is a reducing subspace of M, we say that 71 reduces M. 

1.1. Review of envelopes 

Envelopes were originally proposed and developed by Cook, Li and Chiaromonte (2007, 
2010) in the context of multivariate linear regression, 

Yi = cx + f3Xi +£i, i = I, ... ,n, (1) 

where Si G M'’ is a normal error vector with mean 0, variance S > 0 and is independent 
of X, CK G M’’ and /3 G is the regression coefficient matrix in which we are primarily 
interested. Immaterial variation can occur in Y or X or both. Cook et al. (2010) oper¬ 
ationalized the idea of immaterial variation in the response vector by asking if there are 
linear combinations of Y whose distribution is invariant to changes in X. Specifically, let 
P^-Y denote the projection onto a subspace C M’’ with the properties (1) the distribution 
of Q^-Y I X does not depend on the value of the non-stochastic predictor X and (2) P^: Y is 
independent of Q^-Y given X. These conditions imply that the distribution of Q^-Y is not 
affected by X marginally or through an association with P^-Y. Consequently, changes in 
the predictor affect the distribution of Y only via P^-Y and so we refer to P^-Y informally 
as the material part of Y and to Q^-Y as the immaterial part of Y. 

Conditions (1) and (2) hold if and only if (a) B := span(/3) C S (so S envelopes B) 
and (b) 8 reduces S. The S-envelope of B, denoted 8-s(B), is defined formally as the 
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intersection of all reducing subspaces of S that contain B. Let u = dim(£^s(/3)) and let 
(r, Fq) G be an orthogonal matrix with F G and span(F) = £s{^)- This leads 
directly to the envelope version of model Q, 

Yj = CK + Fr/Xi + Si, with S = TflT^ + FofioTo , i = (2) 

where (3 = Fr/, rj G gives the coordinates of /3 relative to basis F, and £l G and 
f2o G ^{r-n)x(r-u) are positive definite matrices. While rj, fl and fio depend on the basis 
F selected to represent £^.{(3), the parameters of interest (3 and S depend only on £s{(3) 
and not on the basis. All parameters in Q can be estimated by maximizing its likelihood 
with the envelope dimension u determined by using standard methods like likelihood ratio 
testing, information criteria, cross-validation or a hold-out sample, as described by Cook et 
al. (2010). The envelope estimator (3 of (3 is just the projection of the maximum likelihood 
estimator B onto the estimated envelope, [3 = P^B, and y/n{ vec(/3) — vec(/3)) is asymp¬ 
totically normal with mean 0 and covariance matrix given by Cook et al. (2010), where 
u is assumed to be known. An introductory example of response envelopes is available in 
Cook and Zhang (2015 a). 

Similar reasoning leads to partial envelopes for use when only selected columns of /3 are 
of interest (Su and Cook, 2011), to predictor envelopes allowing for immaterial variation in 
X (Cook, Helland and Su, 2013), to predictor-response envelopes allowing simultaneously 
for immaterial variation in X and Y (Cook and Zhang, 2015b) and to heteroscedastic 
envelopes for comparing the means of multivariate populations with unequal covariance 
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matrices (Su and Cook, 2013). 


Cook and Zhang (2015a) extended envelopes beyond multivariate linear models by 
proposing the following estimative construct for vector-valued parameters. Let 0 denote 
an estimator of a parameter vector 6 E Q C based on a sample of size n and assume, 
as is often the case, that y/n{0 — 6) converges in distribution to a normal random vector 
with mean 0 and covariance matrix V(0) > 0 as n —)■ cx). To accommodate the presence 
of nuisance parameters, decompose 6 as 6 = 0^)^, where 0 e W, p < m, is 

the parameter vector of interest and i/) E is the nuisance parameter vector. The 

asymptotic covariance matrix of <p is represented as V<^<^(0), which is the p x p lower 
right block of V (6). Then Cook and Zhang (2015a) defined the envelope for improving <p 
as the smallest reducing subspace of V<^<^(0) that contains span(0), £^v^^{6»)(span(0)) C 
W. This definition links the envelope to a particular pre-specified method of estimation 
through the covariance matrix V<^<^(0), while normal-theory maximum likelihood is the 
only method of estimation allowed by the previous approaches. The goal of an envelope 
is to improve that pre-specified estimator, perhaps a maximum likelihood, least squares 
or robust estimator. Second, the matrix to be reduced - here V<^<^(0) - is dictated by the 
method of estimation. Third, the matrix to be reduced can now depend on the parameter 
being estimated, in addition to perhaps other parameters. Cook and Zhang (2015a) sketched 
application details for generalized linear models, weighted least squares, Cox regression 
and described an extension to matrix-valued parameters. 
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1.2. Computational issues 


The approaches reviewed in the last section all require estimation of an envelope, now rep¬ 
resented generically as the smallest reducing subspace of M e that contains 

W C where M > 0. Let u = dim(£^M(W)), let F G be a semi-orthogonal ba¬ 
sis matrix for £^m(W), let M be a y^-consistent estimator of M, and let U be a positive 
semi-definite y^-consistent estimator of a basis matrix U for W. With u specified, the most 


common objective function used for envelope estimation is 


Lu{G) = log |G' MG| + log |G' (M + U)-^G|, 


(3) 


and the envelope is estimated as = span{argminL„(G)}, where the minimum is 

taken over all semi-orthogonal matrices G G Objective function (|^ corresponds 

to maximum likelihood estimation under normality for many envelopes, including those 
associated with ([^. Otherwise it provides a y^-consistent estimator of the projection onto 
provided M and U are y^-consistent (Cook and Zhang, 2015c, who also provided 
additional background on L„(G)). 

In the case of response envelopes reviewed in Section [Llj M is the covariance matrix of 
the residuals from the ordinary least squares fit of (j^, denoted Sy|x, and M-fU is marginal 
sample covariance matrix of Y, denoted Sy, and the envelope estimator /3 = P^B is the 
maximum likelihood estimator if the errors are normal. If the errors are not normal but 
have finite fourth moments then /3 is yn-consistent and asymptotically normal. In the 
general context of Cook and Zhang (2015a), also reviewed in Section 0 M is set to a 
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/^-consistent estimator of V<^<^(0) and U = <p<p . 

For any orthogonal matrix O G L„(G) = L„(GO), so L„(G) depends only on 

span(G) and not on a particular basis. Thus the optimization problem is over a Grassman- 
nian (See Edelman et al. (1998) for background on optimization over Grassmann mani¬ 
folds.). Since it takes u{r — u) real numbers to specify uniquely, Grassmann opti¬ 

mization is usually computationally straightforward when u{r — u) is not too large, but it 
can be very slow when u{r — u) is large. Also, since Lu{G) is non-convex, the solution re¬ 
turned may correspond to a local rather than global minimum, particularly when the signal 
is small relative to the noise. 

It is important that we have a fast and reliable method of determining argminL„(G) 
because we may need to repeat that operation hundreds or even thousands of times in an 
analysis. An information criterion like AIC or BIG is often used to select a suitable value 
for u, and this requires that we find argminL„(G) for u = 0,1,..., r. Predictive cross 
validation might also be used to select u, again requiring many optimizations of L„(G); 
repeating five fold cross validation with 50 random partitions require in total 250 x r opti¬ 
mizations. Asymptotic standard errors are available for many normal models, but we may 
wish to use a few hundred bootstrap samples to determine standard errors when normality 
is in doubt or when we wish to check the accuracy of the asymptotic approximations. And 
may more bootstrap samples may be required if we want accurate inference statements. In 
some analyses we may wish to fit a few model variations, again multiplying the computa¬ 
tion time. In cases like those discussed at the end of Section [ lT| M = V<^<^(0), which may 
depend on unknown parameters, necessitating another level of iteration for the best results 
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(See Cook and Zhang 2015a for further diseussion of this point.) In short, a seemingly 
small savings in computation time for one optimization of C„(G) can translate into mas¬ 
sive savings over the course of an analysis. Additionally, the choice of starting value for G 
can be crucial since the objective function is non-convex. Converging to a local minimum 
can negate the advantages of maximum likelihood estimation, for example. Trying several 
different starting values is not really an effective method since it again multiplies the total 
computation time and in our experience is not likely to result in the global optimum. 

Cook, Su and Yang (2014; https://github.com/emeryyi/envlp) developed a fairly com¬ 
prehensive Matlab toolbox envlp for envelope estimation based on Lippert’s sg min pro¬ 
gram for optimization over Stiefel and Grassmann manifolds (http://web.mit.edu/~ rip- 
per/www/software/). This is a very effective toolbox for small to moderate sized analyses, 
but otherwise is susceptible to all of the issues mentioned previously. Cook and Zhang 
(2015c) replaced T„(G) with a sequential ID algorithm that can be computationally much 
faster than sg min and is less dependent on good starting values. Nevertheless, it is still 
susceptible to the problems described previously, although less so than methods based on 
sg^min. Additionally, since it does not provide argminLu(G), it loses the advantages of 
that accrue with maximum likelihood estimation when normality is a reasonable assump¬ 
tion. For instance, information criteria like AIC and BIC are no longer available straight¬ 
forwardly, and likelihood ratio testing is problematic and thus dimension selection must 
typically be guided by cross validation. 

In this paper we propose an iterative non-Grassmann method to compute arg min L„(G) 
that is faster and more reliable that existing methods in large analyses and otherwise per- 
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forms about the same. It depends crueially on new effective A/n-consistent starting values 
that can also be used as standalone estimators. We restrict our comparisons to the ID 
algorithm, since Cook and Zhang (2015c) have demonstrated its superiority over direct 
optimization methods based on sgjnin. 

The new starting values are developed in Section]^ and the new algorithm, which relies 
the new starting values, is described in SectionSupporting simulation results are given in 
Section|^and contrasts on real data are given in Section]^ Proofs are given in an appendix. 

2. Starting values 

In this section we describe how to choose the u columns of the starting value for G from the 
eigenvectors of M or M + U. To gain intuition about the approach, consider the population 
representations U = TVr^, M = mT^ + TonoT'^ and (M + U)-^ = T{n + V)-^T^ + 
For^Q . For the starting values selected from the eigenvectors of M to work well, the 
eigenvalues of fl need to be well distinguished from those of CIq. If some of the eigenvalues 
of fl are close to a subset of the eigenvaues of Qq then in samples the corresponding 
eigenspaces will likely be confused when attempting to minimize Lu{G). In other words, 
we may well pick vectors near span(Fo) instead of eigenvectors near span(F) = £^m(W). 
In such cases we may obtain a better starting value by choosing from the eigenvectors of 
M + U rather than the eigenvectors of M. The same argument applies to choosing the 
starting values from the eigenvectors of M + U: the eigenvalues of 17 + V need to be 
well distinguished from those of 17o. If some of the eigenvalues of 17 + V are close to a 
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subset of the eigenvalues of fio then in samples the corresponding eigenspaces will again 
likely be confused. In such cases we may obtain better starting values by starting with the 
eigenvectors of M rather than the eigenvectors of M + U. The general conclusion from 
this discussion is that for effective starting values we will need to consider both M and 
M + U. Scaling will also be an issue, as discussed later in this section, leading to four 
potential starting values. The actual starting value used is the one that minimizes Lu{G). 
We make use of the following result. 

Proposition 2.1 Let (G, Go) G be an orthogonal matrix with G G and let M G 
be a positive definite matrix. Then log |G^MG| + log |Go MGo| and log |G^MG| + 
log|G^M-iG| are both minimized globally when the columns of G span any u dimen¬ 
sional reducing subspace o/M. 

In the next section we describe how to select starting values from the eigenvectors of 

M. 

2.1. Choosing the starting value from the eigenvectors of M 

Define Ji(G) = log|G^MG| + log|G^MGo|, J 2 (G) = log|C_„ + G^UmGqI and 
J(G) = Ji(G) + J 2 (G), where Um = is a standardized version of U. 

Assume for convenience that the eigenvalues of M are unique, which will typically hold 
with probability 1, and let Vu be the collection of all subsets of u eigenvectors of M. Then 

Proposition 2.2 argmincev^-^^«(G) = argmincev,. -^(G). 
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Consequently, instead of Lu{G) we ean work with the more amenable objeetive funetion 
J(G) = Ji(G) + J 2 (G) when restricting starting values to the eigenvectors of M. It 
follows from Proposition |2.1| that Ji(G) is minimized when the columns of G are any u 
eigenvectors of M. Restricting G G Vu, we next need to find argmincev^ J 2 {G). This 
does not have a closed-form solution and evaluating at all r-choose-w elements of Vu will 
be effectively impossible when r is large. For these reasons we replace the log-determinant 
in J 2 (G) with the trace and minimize tr(Ir_u -f Gg UmGo), which is equivalent to maxi¬ 
mizing 

U 

Km{G) := tr(G^UMG) = 

i=l 

where gj is the z-th selected eigenvector of M (the z-th column of G). Computation is now 
easy, since we just select the u eigenvectors of M that maximize gf Uivigi- 

Applying this in response envelopes, let Sx denote the marginal sample covariance 
matrix of the predictors. Then M = Sy|x, U = BSxB^, Um = Sy|x BSxB^S^jx , and 
Sy|x BSx^ is a standardized version of the ordinary least squares estimator B of /3. 

2.2. Choosing the starting value from the eigenvectors of M + U 

Define Jl{G) = log|G^(M + U)G| + log|G^(M -f U)-^G|, J 2 *(G) = log|I„ - 
G^Um+uG| and J*(G) = J*(G) + J*(G), where Um+u = (M+U)-^/ 2 fj(M+U )-^/2 

is another standardized version of U. Let V* be the collection of all subsets of u eigenvec¬ 
tors of M -f U. Then 

Proposition 2.3 argmincev* Lu{G) = argmincev; ^*(G). 
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Consequently, instead of Lu{G) we ean again work with a more amenable objeetive 


funetion, this time J*(G) = J^(G) + J 2 (G). It follows fromProposition [2!T] that Jj" (G) is 
minimized when the eolumns of G are any u eigenveetors of M + U. Restrieting G G V*, 
we next need to find arg mincev„ Again, this does not have a elosed-form solution 

and evaluating at all r-ehoose-u elements of V* will be effeetively impossible when r is 
large. For these reasons we again replaee the log-determinant with the traee and minimize 
tr(I^j — G^Um+uG), whieh is equivalent to maximizing 


-Am+u(G) tr(G^UM+uG) — gfUM+ugi, 

i=l 

where gj is the i-th seleeted eigenveetor of M -|- U (the i-th eolumn of G). Computation is 
again easy, sinee we just seleet the u eigenveetors of M -f U that maximize gf UM+ugi- 
This is exaetly the same as the previous ease, exeept the standardization of U is with 
(M + U)-i/2 instead of 

Applying this in response envelopes, M = Sy|x, U = BSxB^, M + U = Sy, 
Um+u = Sy^^^BSxB^Sy^'^^ and Sy^^^BS]/^ is just a standardized matrix of ordinary 
least squares regression eoeffieients as before. 

2.3. Scaling and consistency 

The standardized forms Um and Um+u are important when the seales involved in M and 
M+U are very different. This ean perhaps be appreeiated readily in the eontext of response 
envelopes, where M = Sy|x and M + U = Sy- In this ease the standardization will be 
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important and effective if the scales of the elements of Y are very different. However, the 
standardization will be effectively unnecessary when the scales are similar. In the case of 
response envelopes this means that the scales of the elements of Y are the same or similar. 

Depending on the scales involved, standardization can also be counterproductive when 
the sample size is not large enough to give sufficiently accurate estimates of M and U. 
In such cases, we abandon the standardization and use either i^^(G) = 
or as the objective function. The only difference between 

these is that i^^(G) confines G to the eigenvectors of M, while iTM+u(^) confines G 
to the eigenvectors of M + U. We now have four possible starting values from which to 
choose, corresponding to the arguments that minimize Kj^iG), Kl^{G), ii'M+u(G), and 
-^m+u(G)- The value Gstart chosen to start the algorithm described in Section]^ is the one 
that minimizes L„(G). 

We conclude this section with the following consistency result: 

Proposition 2.4 Let Pgtart denote the projection onto span(Gstart)- Then with known u, 
Pstart is a y/n-consistent of the projection onto £^m(W). 

3. New iterative algorithm 

In this section we describe a re-parameterized version of Lu{G) that does not require op¬ 
timization over a Grassmannian. The new parameterization requires first selecting u rows 
of G G and then constraining the matrix Gi G formed with these rows to be 
non-singular. Without loss of generality, assume that Gi is constructed from the first u 
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rows of G which we can then partition as 


G = 




y G^y 


/ \ 


V^/ 


Gi = CaGi, 


where A = G 2 G]'^ G is an unconstrained matrix and Ca = (!«, A^)^. Since 

G^G = lu and Gi is non-singular, GiGf = (C^Ca)“^- Using these relationships, 
Lu{G) can be re-parameterized as a function of only A: 


LuiA) 


;iug 


V J 




With this objective function minimization over A is unconstrained. The number of real 
parameters u{r — u) comprising A is the same as the number of reals needed to specify a 
single element in the Grassmannian Q{u, r). 

\iu{r — u) is not too large, T(A) might be minimized directly by using standard opti¬ 
mization software and the starting values described in Section In other cases minimiza¬ 
tion can be carried out by minimizing iteratively over the rows of A. Suppose that we wish 
to minimize over the last row of A. Partition 


\ 


/ \ 


f- 

- ^ 


f - 

- \ 

Ai 

, Ca = 

Cai 

, M = 

Mn 

Mi 2 

, (M + U)-i = 

Vii 

V12 

T 

^ ) 


T 


^ M21 

M22 y 


^V 2 l 

U22 y 


Then after a little algebra, the objective function for minimizing over with Ai held fixed 
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can be written up to terms that do not depend on a as 


L„(a I Ai) = -21og {l + a^(CAiCAi) 

+ log |l + M 22 (a + 

+ log {l + l/ 22 (a + + %'Ci^Vi 2 )} , 

where 

Wi = (Mh - %2'Mi2M2i) Cai 

W 2 = (Vn-V'22lVi2V2l)CA,. 

The objective function L(a | Ai) can now be minimized using any suitable off-the-shelf 
algorithm. Iteration then cycles over rows of A until a convergence criterion is met. 

This algorithm requires the starting value Ggtart described in Section Prior to ap¬ 
plication of the algorithm we must identify u rows of Ggtart and then constrain the matrix 
Gstart.M formed from those u rows to be non-singular. This implies that the matrix formed 
from the corresponding rows of a basis matrix for £^m(W) should also be non-singular. This 
can be achieved reliably by first applying Gaussian elimination with partial pivoting to 
Gstart- The u rows of Gstart identified during this process then form Gstart,«- 

Proposition 3.1 Assume that the eigenvalues o/M and M -f U are distinct. Then the uxu 
submatrix 0 /Gstart that consists of the u rows selected by Gaussian elimination converges 
to a non-singular matrix with rate y/n. 
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This proposition shows that asymptotically Gaussian elimination produces a non-singular 
submatrix. The eondition that the eigenvalues of M and M + U be distinct is mainly for 
clarity of exposition and is not neeessary. The proof given in the appendix demonstrates a 
more eomplete result. Let Fstart denote the population version of Gstart, and let Fgtart.n G 
^uxu consist of the u rows of Fgtart formed by applying Gaussian elimination to Fgtart- 
Then Fgtart is a basis matrix for £^m(W) and Ggtart.n converges to Fgtart.u at rate y/n. 

The new algorithm estimates a basis F row by row, while the ID algorithm optimizes 
eolumn by column. When u is small, the ID algorithm tends to be a bit more efficient as it 
optimizes one eolumn at a time and it needs only one pass through those eolumns. When 
u is larger, the new algorithm dominates, and sometimes substantially. In eaeh estimation, 
the ID algorithm uses conjugate gradient with Polak-Ribiere updates while our algorithm 
uses Newton updates. 

4. Simulations 

4.1. starting values 

The first series of simulations was designed to illustrate why it is important to eonsider the 
eigenvalues of both M and M + U. All simulations are for response envelopes reviewed 
in Seetion fTrj model Q. The results displayed in the tables of this section are the average 
over 50 replieations in eaeh simulation seenario. The angle Z(span(Ai), span(A 2 )) be¬ 
tween the subspaees spanned by eolumns of the semi-orthogonal basis matriees Ai G 
and A 2 G W'^'^ was eomputed in degrees as the arc cosine of the smallest absolute singu- 
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lar value of A^B, and (3. 


start 


start 


B, where Pstart is as defined in Proposition 


2.4 


The starting value is still denoted as Gstart but its definition depends on the simulation. 


r = argminL(G) was obtained from the new algorithm deseribed in Seetion [fusing the 
simulation-speeifie starting value Gstart, and EmiU) = span(r). 


Scenario I. This simulation was designed to illustrate a regression in whieh the eigen¬ 
values of S are elose and the signal is strong. We generated the data with p = r = 100, 
n = 500 and u = 20, taking Cl and CIq to be diagonal matriees with diagonal elements gen¬ 
erated as independent uniform (49, 51) variates. Elements in rj were independent uniform 
(0,10) variates, X followed a multivariate normal distribution with mean 0 and eovarianee 
matrix 400Ip, and the elements of (P, Pq) G were obtained by standardizing a matrix 
of independent uniform (0,1) variates. In this seenario, the eigenvalues of S are elose to 
eaeh other, but we have a strong signal arising from the distribution of X. Starting values 
based on the eigenveetors of M = Sy|x were expeeted to perform poorly, while starting 
values based on M -|- U = Sy were expeeted to perform well, as eonjeetured at the start 
of Seetion|^and eonfirmed by the results in Table 

The overarehing eonelusion from Table is that the starting values from Sy did very 
well, whether U was standardized or not, while the starting values from Sy|x were effee- 
tively equivalent to ehoosing a 20-dimensional subspaee at random. Additionally, iteration 
from the starting value produeed essentially no ehange in the angle, the value of the objee- 
tive funetion or the envelope estimator of /3. 
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Standardized U 

Unstandardized U 

Summary statistic 

Sy = M + U 

Sy|x = M 

Sy = M + U 

Sy|x = M 

Z(span(Gstart),^^M(W)) 

0.58 

89.05 

0.58 

88.98 


0.58 

88.58 

0.58 

88.74 

-^'u(Grstart) 

-182.10 

-13.18 

-182.10 

-9.94 

Lu{T) 

-182.10 

-21.95 

-182.10 

-20.01 

il3start -/3||2 

0.27 

149.58 

0.27 

136.51 

I|3-/3||2 

0.27 

113.02 

0.27 

101.67 


Table 1: Results for Seenario 1. The starting value Gstart was eonstructed from the eigen¬ 
vectors of the matrices indicated by the headings for columns 2-5. 


Scenario II. We generated data with p = r = 100, n = 500 and u = 5, taking 17 = I„ 
and 17o = lOOI^-u. Elements in rj were independent uniform (0,10) variates, X followed 
multivariate normal distribution with mean 0 and covariance matrix 25Ip, and (F, Fq) was 
obtained by standardizing an r x r matrix of independent uniform (0,1) variates. Since the 
eigenvalues in 17 and 17o are very different and the signal is modest, the results in Table 
show as expected from the argument given in Section 2 that the starting values based on 
M = Sy|x did much better than those based on Sy- As in Scenario I, the starting value did 
very well. Iteration improved the starting value a small amount and scaling had no notable 
affect. 


Scenario III. The intent of this simulation is to demonstrate the importance of scaling 
U. We generated data with p = r = 30, n = 200 and u = 5, taking 17 to be a diagonal 
matrix with diagonal elements 1.5^, ■ ,1.5“ and 17o to be a diagonal matrix with diagonal 

elements 1.5“’''^, ■ ■ ■, 1.5''. Elements in rj were generated as independent uniform (0,10) 
variates, X follows the multivariate normal distribution with mean 0 and covariance matrix 
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Standardized U 

Unstandardized U 

Summary statistic 

Sy = M + U 

Sy|x = M 

Sy = M + U 

Sy|x = M 

Z(span(Gstart),^^M(W)) 

45.88 

3.87 

45.88 

3.87 


36.55 

3.78 

36.55 

3.78 

-^'u( Gstart) 

-16.19 

-30.88 

-16.19 

-30.88 

Lu{T) 

-20.74 

-30.95 

-20.74 

-30.95 

il3start -/3||2 

1.93 

0.66 

1.93 

0.66 

I|3-/3||2 

1.64 

0.57 

1.64 

0.57 


Table 2: Results for scenario 11. The starting value Gstart was constructed from the eigen¬ 
vectors of the matrices indicated by the headings for columns 2-5. 


lOOIp, and (F, Fq) = Ir. We see from the results of Tablej^that standardization performs 
well and that now iteration improves the starting value considerably. Here and in the other 


results of this section, the smallest value of T.u(Gstart) produced best results. 



Standardized U 

Unstandardized U 

Summary statistic 

Sy = M + U 

Sy|x — M 

Sy = M + U 

Sy|x — M 

Z(span(Gstart), ^^m(7/)) 

48.63 

16.72 

89.35 

33.31 

Z(^m(W),^m(W)) 

17.92 

1.54 

89.34 

22.77 

-^'u( Gstart) 

-13.43 

-35.75 

-12.69 

-34.09 

LuiT) 

-32.48 

-46.93 

-23.26 

-44.84 

Il3start -/3||2 

11.82 

8.56 

20.32 

11.13 

I|3-/3||2 

4.37 

0.72 

20.17 

5.39 


Table 3: Results for scenario III. The starting value Gstart was constructed from the eigen¬ 
vectors of the matrices indicated by the headings for columns 2-5. 


Scenario IV. For this simulation we kept the same settings as Scenario III, except that 
diagonal elements of fl and fio were 1.05^, • • •, 1.05“ and , 1.05^, and (F, Fq) 

was generated by standardizing a matrix of uniform (0,1) random variables. In this setup 
heteroscedasticity across the elements is reduced substantially from that in scenario III. 
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As indicated in Table the standardization no longer provides much improvement. Also, 


since the eigenvalues of and flo are similar, S'y|x again does not work well. 



Standardized U 

Unstandardized U 

Summary statistic 

Sy = M + U 

Sy|x — M 

Sy = M -f U 

Sy|x — M 

Z(span(Gstart),^^M(W)) 

0.30 

79.57 

0.30 

80.66 


0.30 

73.40 

0.30 

75.58 

-^'u( Gstart) 

-53.54 

-7.92 

-53.54 

-7.27 

Lu{T) 

-53.54 

-13.40 

-53.54 

-12.56 

Il3start -/3||2 

0.08 

33.36 

0.08 

31.59 

I|3-/3||2 

0.08 

25.04 

0.08 

22.61 


Table 4: Results for scenario IV. The starting value Gstart was constructed from the eigen¬ 
vectors of the matrices indicated by the headings for columns 2-5. 


4.2. Comparisons with the ID algorithm 

In this section we give three different simulation scenarios based on response envelopes 
for comparing the new non-Grassmann algorithm with the ID algorithm. In all scenarios 
p = 100, CK = 0, orthogonal bases (r,ro) were obtained by normalizing an r x r ma¬ 
trix of independent uniform (0,1) variates, the elements in r; G were generated as 
independent uniform (0,10) variates, and /3 = Trj. The predictors X were generated as 
independent normal random vectors with mean 0 and variance 4001.^. We varied u from 
1 to 90 and recorded and computing times and the angles between the true and estimated 
subspaces. 

The ID algorithm was implemented in R for all simulations reported in this and the 
next section. Using efficient programming tools in R, it is now much faster than its Matlab 
version, which produced the results in Cook and Zhang (2015c). To insure a fair compar- 
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ison, we used the default convergence criterion in R for optimizations within both the ID 
algorithm and the new algorithm. The angle between subspaces was computed as described 
previously. In all case the results tabled are the averages over 50 replications. We use Fid 
to denote the basis generated by the the ID algorithm. 

Scenario V. In this scenario we set r = 100 and n = 250. To reflect multivariate re¬ 
gressions with large immaterial variation, so envelopes give large gains, we generated the 
error covariance matrix as S = rnr^ + TonoT'^, where Cl = AA^, Qq = BB^, the 
elements in A were generated as independent standard normal variates and elements in B 
were generated as independent normal (0,5^) variates. The results are shown in Table 
The ID algorithm tends to perform a bit better on accuracy (Tablefor small values of 
u, while performing poorly for large values of u. The same phenomenon occurs in terms of 
time: the ID algorithm tends to be a bit faster for small values of u, but otherwise can take 
much longer than the new non-Grassmann algorithm. The relatively small times for the 
new algorithm at m = 5,10, 20, 60 occurred because in those cases the starting value was 
quite good and little iteration was required. The same qualitative differences hold when 
considering the norm between the estimated coefficient matrix and the true value from the 
simulation. Note also that the angle for the starting value by itself was often smaller than 
that for the ID algorithm. 

Scenario VI. We again set r = 100 and n = 250. To reflect multivariate regressions 
with small immaterial variation, so envelopes give worthwhile but relatively modest gains, 
we generated the error covariance matrix as S = FfiF^ + Forioro > where f2 = AA^, 
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(A) Angle 

£m{k) 

span(Gstart) 

span(fiD) 

u = 1 

0.92 

1.68 

0.64 

u = 5 

3.56 

3.60 

1.81 

u = 10 

4.67 

4.73 

4.60 

u = 20 

5.83 

5.84 

42.77 

u = 30 

4.84 

6.07 

12.37 

u = 40 

5.59 

7.39 

6.24 

u = 50 

6.81 

7.62 

39.57 

u = 60 

8.48 

8.49 

70.37 

u = 80 

7.61 

10.01 

25.51 

u = 90 

7.15 

12.04 

21.02 

(B) Time 

^m(77) 

span(Gstart) 

span(fiD) 

u = 1 

2.30 

0.03 

0.23 

u = 5 

0.19 

0.03 

1.45 

u = 10 

0.37 

0.03 

2.71 

u = 20 

0.34 

0.03 

5.16 

u = 30 

7.49 

0.04 

6.23 

u = 40 

7.58 

0.04 

7.30 

u = 50 

5.53 

0.05 

9.18 

u = 60 

0.97 

0.05 

10.59 

u = 80 

2.21 

0.07 

11.07 

u = 90 

1.55 

0.08 

10.40 


Table 5: Scenario V: (A) Angle between EmiU) and the indicated subspace. (B) Computing 
time in seconds for the indicated subspace. span(Gstart) and span(riD) denote the 

estimated subspaces by the new non Grassmann algorithm, the starting values described in 
Section 1^ and the ID algorithm. 

fio = BB^, the elements in A were generated as independent normal (0,5^) variates 
variates and elements in B were generated as independent standard normal variates. The 
results shown in Table broadly parallel those in Table for Scenario V, but now the 
performance of the new algorithm is stronger, both in terms of accuracy and time. 
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(A) Angle 

Sm{K) 

span(Gstart) 

span(fiD) 

u = 1 

0.32 

0.32 

0.33 

u = 5 

0.75 

0.75 

0.70 

u = 10 

0.89 

0.89 

2.94 

u = 20 

1.13 

1.14 

21.00 

u = 30 

1.24 

1.24 

10.73 

u = 40 

1.36 

1.36 

12.68 

u = 50 

1.40 

1.40 

16.97 

u = 60 

1.45 

1.45 

31.20 

u = 80 

1.64 

1.64 

6.67 

u = 90 

1.14 

1.14 

4.10 

(B) Time 


span(Gstart) 

span(fiD) 

u = 1 

0.08 

0.04 

0.30 

u = 5 

0.10 

0.03 

1.13 

u = 10 

0.18 

0.03 

2.52 

u = 20 

0.29 

0.04 

3.82 

u = 30 

0.42 

0.04 

6.42 

u = 40 

0.71 

0.04 

7.71 

u = 50 

0.38 

0.05 

9.74 

u = 60 

0.31 

0.06 

10.62 

u = 80 

0.61 

0.07 

11.69 

u = 90 

0.21 

0.09 

11.00 


Table 6: Scenario VI: (A) Angle between EmiU) and the indicated subspace. (B) Com¬ 
puting time in seconds for the indicated subspace. span(Gstart) and span(riD) 

denote the estimated subspaces by the new non Grassmann algorithm, the starting values 
described in Sectionand the ID algorithm. 

Scenario VII. This scenario was designed to emphasize the time differences between the 
ID algorithm and the non Grassmann algorithm. We set n = 500 and varied r from 150 
to 350. The error covariance matrix was constructed as S = + ToflorQ, where 

f2 = I, rio = 251. The results in Table indicate that estimative performance of the two 
algorithms is essentially the same. However, as shown in Table the ID algorithm can 
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take considerably longer than the non Grassmann algorithm. To emphasize the differences, 
the ID algorithm with r = 350 would take about 2.5 hours to estimate the envelope for 
each u between 1 and 90, while the non Grassmann algorithm would take only about 0.15 
hours. In practice we would normally need to estimate the envelope for each u between 1 
and 350, leading to much longer computing times. 

5. Contrasts on real data 

In this section we compare the computing time for the new non Grassmann algorithm and 
the ID algorithm to select an envelope dimension by minimizing prediction errors from five 
fold cross validation, the method typically used in conjunction with the ID algorithm. The 
time reported is, for each u, the total optimization time over 250 optimizations comprised 
of 50 replications of five fold cross validation. 

5.1. Alzheimer data 

The Al z heimer data contains volumes of r = 93 regions of the brain from each of 749 
Alzheimer patients (Zhu et al. (2014)). We used gender, age, the logarithm of intracere- 
broventricular volume, and interactions involving gender as predictors. After taking the 
logarithms of all brain volumes, we fitted the response envelope model using both the new 
algorithm and the ID algorithm. There was little to distinguish methods based on predictive 
performance, but the time differences are clear, as displayed in Figure As we observed 
in the simulations, the times for the two algorithms are close for relatively small values of 
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(A) 

Angle 

r 

Sm{K) 

= 150 

span(riD) 

r 

= 250 

span(riD) 

r 

= 350 

span(riD) 

u = 1 

0.29 

0.29 

0.34 

0.34 

0.42 

0.42 

u = 5 

0.67 

0.67 

0.78 

0.78 

1.02 

1.03 

u = 10 

0.75 

0.75 

0.99 

1.00 

1.12 

1.13 

u = 20 

0.96 

0.96 

1.14 

1.14 

1.44 

1.46 

u = 30 

1.14 

1.14 

1.59 

1.61 

1.73 

1.76 

u = 40 

1.31 

1.31 

1.78 

1.78 

2.13 

2.16 

u = 50 

1.69 

1.68 

1.95 

1.95 

2.46 

2.50 

u = 60 

2.00 

1.98 

2.94 

2.93 

3.16 

3.18 

u = 80 

2.87 

2.74 

5.20 

4.94 

5.72 

5.66 

u = 90 

5.49 

4.30 

8.27 

7.01 

10.93 

9.53 

(B) 

Time 

r 

= 150 

span(riD) 

r 

Sm(M) 

= 250 

span(riD) 

r 

= 350 

span(riD) 

u = 1 

0.16 

0.18 

0.45 

0.64 

0.96 

1.65 

u = 5 

0.21 

0.85 

0.54 

3.30 

1.08 

8.23 

u = 10 

0.28 

2.26 

0.64 

8.31 

1.22 

20.89 

u = 20 

0.42 

6.16 

0.94 

23.4 

1.64 

51.61 

u = 30 

0.62 

9.56 

1.33 

37.00 

2.18 

81.46 

u = 40 

0.86 

12.94 

1.83 

48.45 

2.86 

110.06 

u = 50 

1.09 

16.03 

2.38 

59.20 

3.73 

135.05 

u = 60 

1.40 

18.62 

3.13 

68.23 

4.85 

157.65 

o 

00 

2.08 

22.50 

9.77 

87.08 

15.07 

196.84 

M = 90 

2.50 

23.91 

11.74 

91.20 

27.97 

212.87 


Table 7: Scenario VII. (A) Angle between and the indicated subspace. (B) Comput¬ 
ing time in seconds for the indicated subspace. and span(riD) denote the subspaces 

by the new non Grassmann algorithm and the ID algorithm. 

u and diverge for larger values of u. The total optimization time over all 250 x r = 23, 250 
optimizations was about 22 hours for the new algorithm and 60 hours for the ID algorithm. 
The overall computation time is relatively large because the signal in the data is somewhat 
weak. 
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The value of u selected by the new algorithm was u = 17, while the ID algorithm 
selected u = 6. The average prediction errors estimated by the two methods at their own 
values of u were essentially the same. 


Alzheimer 



U 


Figure 1: Alzheimer data: for each u the vertical axis is the total optimization time over 
250 optimizations comprised of 50 replications of five fold cross validation. The solid line 
marks the new non Grassmann algorithm and the dashed line marks ID algorithm. 


5.2. Glass data 

Our algorithm is also applicable in envelope contexts other than response envelopes. We 
used predictor envelopes (Cook et al. (2013)) for this illustration. 
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The dataset contains measurements of the chemical composition and electron-probe-X- 
ray microanalysis for 180 archeological glass vessels from 15th to 17th century excavated in 
Antwerp, Belgium. For each vessel, a spectrum on a set of equispaced frequencies between 
1 and 1920 is measured, however the values below 100 and above 400 are almost null. 
Following Kudraszow and Maronna (2011), we chose 13 equispaced frequencies between 
100 and 400 as predictors. The response variable is the amount of sulfur trioxide. For each 
u = 1,...,13, we ran the ID algorithm and the new algorithm, recoding the prediction 
error from 50 replications of five fold cross validation and the average computing time for 
these 250 optimizations. The new algorithm gave a four percent improvement in prediction 
error over the ID algorithm at u = 3, which was best for both methods. As in the Alzheimer 
data, there were clear differences in computing time, as shown in Figure]^ The total time 
for computing all u was 86 seconds for the new algorithm and 541 seconds for the ID 
algorithm. 
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Glass vessel data 



u 


Figure 2: The solid line marks the new non Grassmann algorithm and the dashed line marks 
ID algorithm. 

References 

[1] Cook, R.D., Holland, 1. S. and Su, Z. (2013), Envelopes and partial least 
squares regression. Journal of the Royal Statistical Society B 75 , 851-877. 

[2] Cook, R.D., Li, B. and Chiaromonte, F. (2007). Dimension reduction in re¬ 
gression without matrix inversion. Biometrika 94 , 569-584. 

[3] Cook, R.D., Li, B. and Chiaromonte, F. (2010). Envelope models for parsi¬ 
monious and efficient multivariate linear regression (with discussion). Statis- 


28 




tica Sinica, 20, 927-1010. 


[4] Cook, R. D., Su, Z. and Yang, Y. (2014). A MATLAB toolbox for computing 
envelope estimators in multivariate analyses. Journal of Statistical Software 
62, http://www.jstatsoft.org/v62/i08/paper, 

[5] Cook, R.D. and Zhang, X. (2015a). Foundations for envelope models and 
methods. Journal of the American Statistical Association 110, 599-611. 

[6] Cook, R.D. and Zhang, X. (2015b). Simultaneous envelopes for multivariate 
linear regression. Technometrics 57, 11-25. 

[7] Cook, R.D. and Zhang, X. (2015o). Algorithms for envelope esti¬ 
mation. Journal of Computational and Graphical Statistics, to appear 
(http://arxiv.org/pdf/1403.4138 .pdf) 

[8] Edelman, A., Arias, T. A. and Smith, S. T. (1998). The geometry of algo¬ 
rithms with orthogonality eonstraints. SIAM Journal on Matrix Analysis and 
Applications 20, 303 - 353. 

[9] Kudraszow, N. L. and Maronna, R. A. (2011). Estimates of MM type for the 
multivariate linear model. Journal of Multivariate Analysis, 102 1280-1292. 

[10] Su, Z. and Cook, R.D. (2011), Partial envelopes for efficient estimation in 
multivariate linear regression. Biometrika, 98, 133-146. 

[11] Su, Z. and Cook, R.D. (2013). Estimation of multivariate means with het- 
eroseedastie errors using envelope models. Statistica Sinica, 23, 213-230. 


29 


[12] Zhu, H., Khondker, Z., Lu, Z., and Ibrahim, J. G. (2014). Bayesian Gen¬ 
eralized Low Rank Regression Models for Neuroimaging Phenotypes and 
Genetic Markers. Journal of the American Statistical Association, 109, 977- 
990. 


30 



Appendix 


A. 


Proof of Proposition 



Let (G, Go) G be a column partitioned orthogonal matrix and let M G be 
positive definite. The conclusion that log |G^MG| + log |GoMGo| is minimized when 
span(G) is any u-dimensional reducing subspace of M will follow by showing that |M| < 
|G^MG| X |Go MGol with equality if and only if span(G) reduces M. 


M 


< 


|(G,Go)^M(G,Go)| = 
|G^MG| X IG^MGo- 

IG^MGil X IG^MGol, 


G^MG G^MGo 

G^MG G^MGo 
G;^MG(G^MG)-^G^MGo 


with equality if and only if Gq MG = 0, which is equivalent to requiring that span(G) 
reduce M. 

The conclusion that log |G^MG| + log |GM“^G| is also minimized when span(G) is 
any w-dimensional reducing subspace of M follows because 


log |G^MG| + log IGM-^GI = log |G'^MG| + log |GoMGo| - log |M 
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B. 


Proof of Proposition 


2.2 


Recall that Ji(G) = log |G^MG| + log |G^MGo|, J 2 (G) = log |I^_„ + G^UmGqI and 
J(G) = Ji(G) + J 2 (G), where Um = is a standardized version of U. 

Then from Proposition |2.1[ an argument minimizes T„(G) if and only if it minimizes 


/(G) = log|G^MG|+log|G^(M + U)Go| 

= log|G'^MG|+log|G^(M + uu^)Go| 

= log |G^MG| + log |G^MGo| + log + u'^Go(G^MGo) ^G^u| 

= log|G^MG|+log|G^MGo| 

+ log \lr-u + (G^MGo)-'/'G^{iii^Go(G^MGo)-'/'| 

= Ji(G) + /2(G), 

where /2 is defined implicitly and U = uu^ is a decomposition of U with u G To 

see that /2 = J2 over we have 


/ 2 (G) = log|I,_„ + (G^MGo)-'/2G^UGo(G^MGo)-'/2| 

= log \lr-u + (G^MGo)-^/^G^M^/2^M-^/2uM-1/2)M^/2q^^qT^q^)-1/2 
= log|I,_„ + G^(M-'/2uM-V2)Go| 

= log |Ir-u + G^UmGoI 
= ^2(G), 
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where the third equality follows beeause Go G Vu reduees M. 


c. 


Proof of Proposition 


2.3 


Let W = M + U for notational eonvenienee and start with the objeetive funetion 


L„(G) = log|G^MG|+log|G^(M + U)^iG| 

= log|G^MG|+log|G^W-^G| 

= log|G^(M + U)G-G^UG|+log|G^W”^G| 

= log|G^WG - G^GG^GI+log|G^W-iG| 

= log |G^WG| + log |Ifc - u^G(G^WG)-^G^u| + log jG^W^^G 


where u is as defined in the proof of Proposition |2.2[ The sum of the first and last terms 
on the right side of this representation is always non-negative and equals 0, its minimum 
value, when the eolumns of G span any redueing subspaee of W = M + U. Restrieting 
G in this way, 


u^G(G'^WG)'^G^u = 

= U^W-^/2QQr^-l/2:Q^ 
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and the middle term of L(G) reduees to 


log |I - Q'^G(G^(M + U)G)-^G^ii 


log \lk - G^W-i/^GG'^W-^/^QI 
log |I„ - G^W-^/^uG^W-^/^qi 
log|I,-G^UM+uG|, 


where Um+u = = y^-i/ 2 -(jy^-i /2 -Q standardized by ^ 

(M + U)-i/2. 


D. 


Proof of Proposition 


2.4 


We demonstrate the result in detail for Km- The eorresponding result for the other three K 
funetions follows similarly. 

Reeall that ii'M(G) = where g* is an eigenveetor of M. 

The population version of this objeetive funetion is 


KMiG) = 5^(gfM-V2uM-i/2g^) 
i=l 

where g is an eigenveetor of M and G = (gi,..., g„). We next show that 


span ^argmaxiTM(G)j 


Consider a generie envelope £a{S), where A > 0 with eigenspaees Ai, i = 1,... ,q. 
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Cook, et al. (2010) show that this envelope ean be eharaeterizes as £^a(‘5) = 

As a eonsequenee there are u = dim(£^A(‘5)) orthogonal eigenveetors ai,... a„ of A so 
that 

U / ^ 

^a(‘S) = ^ Pa,*S = span j ^ Pa,SS^Pa, 

i=l V*=l 

where s is a basis matrix for S. By definition of Sa{<S), there exists exaetly u eigenveetors 
of A that are not orthogonal to S and these eigenveetors are ai,... a„. Consequently, we 
must have 


£a{.S) = span 


arg max tr 



U 


= span I arg max 2_^ ss^ v,; 

i=l 


where the maximum is taken over the eigenveetors Vj of A. Equality holds sinee the max¬ 
imum must seleet u eigenveetors of A that are not orthogonal to ss^. 

Comparing this general argument with K{G) we see that arg max will seleet u 
eigenveetors of M that are not orthogonal to and eonsequently 


span ^argmax^M(G)j = £^M(span(M 


where the final equality follows from Cook et al. (2010, Prop. 2.4). 

The ^/n eonsisteney now follows straightforwardly sinee the matriees involved in the 
determination of the four potential starting values - M, M + U, Um and Um+u - are all 
ydr-eonsistent estimators of their eorresponding population versions. 
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E. 


Proof of Proposition 



Let Fstart & denote the population eounterpart of Gstart- Based on previous discus¬ 
sion, the columns of r^tart are eigenvectors of M or M -|- U. Since rank(rstart) = m we can 
find u linearly independent rows of Fstart and, letting F^i denote the u x u matrix forms by 
these u rows, we get |F„| ^ 0. Now, let denote the submatrix of Gstart forms by these 
same u rows. It follows straightforwardly in the manner of Proposition [2^ that G„ is a ^/n 
consistent estimator of F„. Since the determinant is a continuous function this implies that 
for n sufficiently large |Gu| 7 ^ 0 with a specified high probability. As a consequence, for n 
sufficiently large, rank(Gstart) = u with arbitrarily high probability. 

Perform Gaussian elimination with partial pivoting on Gstart and denote the resulting 
u X u submatrix by Gstart, From the preceding discussion, Gstart,« is nonsingular with 
high probability for sufficiently large n. Also, perform Gaussian elimination with partial 
pivoting to Fstart and denote the resulting nonsingular u x u submatrix by Fstart,tj- The 
proposition is then established if Gstart,« is a y/n consistent estimator of Fstart,«- 

First we assume that the pivot elements for Fstart are unique and occur in rows r^, 
i = 1 ,... ,M. In the first step of Gaussian elimination, for an arbitrary e > 0 , we can 
find an Ni such that when n > Ni, the corresponding element in row ri of Gstart is 
the one having the largest absolute value with probability at least 1 — e. In other words, 
row ri will be selected in Gstart with probability at least 1 — e. We call the resulting 
matrices Fstart ,1 £ and Gstart ,1 G As Gaussian elimination involves only 

simple arithmetic operations, Gstart ,1 converges to Fstart ,1 at rate y/n. Now, for the second 
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step in Gaussian elimination, we do partial pivoting in the second columns of Gstart.i and 
Tstart,!- Then, for an arbitrary e > 0, we can find an N2 > Ni such that when n > N2, the 
elements chosen for Gstart.i and Fstart,! will be the same with probability at least (1 — e). 

Continuing this process, for n > Nu, rows ri,..., in Gstart are selected with prob¬ 
ability at least (1 — e)“. Let || ■ || denote some matrix norm. As Gstart converges to Fstart 
with rate ^/n, we have i|Gstart,« — Fstart,«|| = and consequently for any e > 0 

there exists K > 0 and Nq so that for all n > Nq, 


P ^V^ll Gstart,It Fstart,rr II ^ ^ 


rows ri,... are selected^ < e. 


Then with n > max(A"o, Ny), 


P ( V^lI Gstart,n Fstart,rj|| ^ -^) 

< P V^l I Gstart,n Fstart,u|| ^ P 


rows ri,... are selectedJ * P(rows ri,... are selected) 


+P(not all rows ri,... are selected) 


< e -f [1 - (1 - e)“]. 


Since e > 0 is arbitrary and e -f [1 — (1 — e)“] tends to 0 as e tends to 0, Gstart,« converges 
to Fstart,n at rate ^/n. 

To deal with non-unique pivot elements, assume that there are ties in one column. When 
we perform Gaussian elimination with partial pivoting on Fstart in the step with k ties, we 
can choose whichever of the tied elements, resulting in all the cases in non-singular matri- 
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ces. We call the resulting matrices Ai,..., A^. When Gaussian elimination was perform 
with partial pivoting on Gstart, using the preceding reasoning, there will be probability at 
least (1 — e )“/k that we pick the rows in Aj, i = 1,..., /c. Then A converges to Ai, A 2 ... 
or Afc with rate ^/n, so A converges to a non-singular matrix with rate ^/n. If we have ties 
in more than one step we divide further probabilities, since the number of the steps and u 
are fixed the proof flows similarly. 
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